Analysis of fairness metrics with missing data (work in progress)
Description:
Prediction task is to determine whether a person makes over 50K a year. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
Cite:
Ron Kohavi, “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid”, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996
Fariness Analysis:
For protected attribute sex, Male is privileged, and Female is unprivileged. For protected attribute race, White is privileged, and Non-white is unprivileged. Favorable label is High income (> 50K) and unfavorable label is Low income (<= 50K).
Missing Values:
Variables sorted by number of missings:
Variable Count
occupation 0.05660146
workclass 0.05638647
native.country 0.01790486
age 0.00000000
fnlwgt 0.00000000
education 0.00000000
education.num 0.00000000
marital.status 0.00000000
relationship 0.00000000
race 0.00000000
sex 0.00000000
capital.gain 0.00000000
capital.loss 0.00000000
hours.per.week 0.00000000
income.per.year 0.00000000
Missings in variables:
Variable Count
workclass 1836
occupation 1843
native.country 583
Description:
The kaggle Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay. For more information about how this dataset was constructed: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt
Cite:
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html
Fariness Analysis:
For protected attribute sex, Female is privileged, and Male is unprivileged. For protected attribute pclass (proxy for socio-economic class), 1st class is privileged, and 2n and 3rd class are unprivileged. Favorable label is survived (survived = TRUE) and unfavorable label is die (Survived = FALSE).
Missing Values:
Variables sorted by number of missings:
Variable Count
age 0.2009167303
embarked 0.0015278839
fare 0.0007639419
pclass 0.0000000000
survived 0.0000000000
name 0.0000000000
sex 0.0000000000
sibsp 0.0000000000
parch 0.0000000000
ticket 0.0000000000
Missings in variables:
Variable Count
age 263
fare 1
embarked 2
Description:
Data on educational transitions for a sample of 500 Irish schoolchildren aged 11 in 1967. The data were collected by Greaney and Kelleghan (1984), and reanalyzed by Raftery and Hout (1985, 1993).
Cite:
http://lib.stat.cmu.edu/datasets/irish.ed
Fariness Analysis:
For protected attribute sex, Male is privileged, and Female is unprivileged (Irish_1 version). For protected attribute sex, Female is privileged, and Male is unprivileged (Irish_2 version). In both versions, favorable label is Leaving Certificate taken (1) and unfavorable label is Leaving Certificate not taken (2).
Missing Values:
Variables sorted by number of missings:
Variable Count
Prestige_score 0.052
Educational_level 0.012
Sex 0.000
DVRT 0.000
Leaving_Certificate 0.000
Type_school 0.000
Missings in variables:
Variable Count
Educational_level 6
Prestige_score 26
No imputation (none): original dataset (AIF360 remove rows with NAs).
No imputation Clean (none_clean): subset of the original dataset containing rows without NAs (equivalent to “none”).
No imputation NAs (none_NA): subset of the original dataset containing only rows with NAs.
Remove Columns (Col): Remove columns containing missing values.
Minimum (Min): Replace missing value with minimum observed values (library Hmisc).
Mean or Mode (Mean/Mode): Replace missing value with mean (if numeric) or mode (if categorical) of values observed (library Hmisc).
Random (Random): Draw random values for imputation, with the random values not forced to be the same if there are multiple NAs (library Hmisc).
Sample (Sample): Random sample from observed values. (library MICE)
Predictive Mean Matching (PPM): Imputation of y by predictive mean matching, based on van Buuren (2012, p. 73). For each observation in a variable with missing value, it finds an observation (from available values) with the closest predictive mean to that variable. The observed value from this “match” is then used as imputed value. (library MICE)
Classification and regression trees (CART): Imputation of y by classification and regression trees. The procedure is as follows: (1) Fit a classification or regression tree by recursive partitioning; (2) For each ymis, find the terminal node they end up according to the fitted tree; (3) Make a random draw among the member in the node, and take the observed value from that draw as the imputation. (library MICE)
Random Forest (RF): Imputation of missing values particularly in the case of mixed-type data. It uses a random forest trained on the observed values of a data matrix to predict the missing values. It can be used to impute continuous and/or categorical data including complex interactions and non-linear relations. (library missForest).
Best (Best): Best combination of imputation methods depending on the variables: PMM for numeric variables; logistic Regression for binary Variables (with 2 levels); bayesian polytomous regression for factor Variables (>= 2 levels); and proportional odds model for ordered variables (>= 2 levels). (library MICE).
Fairness pipeline followed (from AIF360). An example instantiation of this generic pipeline consists of loading data into a dataset object, transforming it into a fairer dataset using a fair pre-processing algorithm, learning a classifier from this transformed dataset, and obtaining predictions from this classifier. Metrics can be calculated on the original, transformed, and predicted datasets as well as between the transformed and predicted datasets. Many other instantiations are also possible.
Figure 1: Figure from https://github.com/IBM/AIF360
Metrics:
Mean Difference (MD): Computed as the difference of the rate of favorable outcomes received by the unprivileged group to the privileged group. The ideal value of this metric is 0.0.
Statistical Parity Difference (SPD): This is the difference in the probability of favorable outcomes between the unprivileged and privileged groups. This can be computed both from the input dataset as well as from the dataset output from a classifier (predicted dataset). A value of 0 implies both groups have equal benefit, a value less than 0 implies higher benefit for the privileged group, and a value greater than 0 implies higher benefit for the unprivileged group.
Disparate Impact (DI): This is the ratio in the probability of favorable outcomes between the unprivileged and privileged groups. This can be computed both from the input dataset as well as from the dataset output from a classifier (predicted dataset). A value of 1 implies both groups have equal benefit, a value less than 1 implies higher benefit for the privileged group, and a value greater than 1 implies higher benefit for the unprivileged group.
Results:
Analysis:
We consider the imputed dataset “Cols” (where we remove columns containing missing values) as the gold standar to which we compare the results obtained from the other imputed datasets.
Technique :
Results:
Analysis:
We consider the imputed dataset “Cols” (where we remove columns containing missing values) as the gold standar to which we compare the results obtained from the other imputed datasets.
Metrics:
Statistical Parity Difference (SPD): This is the difference in the probability of favorable outcomes between the unprivileged and privileged groups. This can be computed both from the input dataset as well as from the dataset output from a classifier (predicted dataset). A value of 0 implies both groups have equal benefit, a value less than 0 implies higher benefit for the privileged group, and a value greater than 0 implies higher benefit for the unprivileged group.
Disparate Impact (DI): This is the ratio in the probability of favorable outcomes between the unprivileged and privileged groups. This can be computed both from the input dataset as well as from the dataset output from a classifier (predicted dataset). A value of 1 implies both groups have equal benefit, a value less than 1 implies higher benefit for the privileged group, and a value greater than 1 implies higher benefit for the unprivileged group.
Average odds difference (OddsDif): This is the average of difference in false positive rates and true positive rates between unprivileged and privileged groups. This is to be computed from the dataset output from a classifier and hence needs to be computed using the input and output datasets to a classifier. A value of 0 implies both groups have equal benefit, a value less than 0 implies higher benefit for the privileged group and a value greater than 0 implies higher benefit for the unprivileged group.
Equal opportunity difference (EOD): This is the difference in true positive rates between unprivileged and privileged groups. This is to be computed from the dataset output from a classifier and hence needs to be computed using the input and output datasets to a classifier. A value of 0 implies both groups have equal benefit, a value less than 0 implies higher benefit for the privileged group and a value greater than 0 implies higher benefit for the unprivileged group.
Theil Index (TI): The Theil index TT is the same as redundancy in information theory which is the maximum possible entropy of the data minus the observed entropy. It is a special case of the generalized entropy index. It can be viewed as a measure of redundancy, lack of diversity, isolation, segregation, inequality, non-randomness, and compressibility. The numerical result is in terms of negative entropy so that a higher number indicates more order that is further away from the “ideal” of maximum disorder. Formulating the index to represent negative entropy instead of entropy allows it to be a measure of inequality rather than equality.
Techniques:
Logistic Regression (LR): From sklearn, Logistic Regression (aka logit, MaxEnt) classifier. Default parameters.
Neural Network (AD-disabled (NN-biased)): From TensorFlow.
Random Forest (RF): From sklearn, a random forest classifier.
Fairnes-aware Techniques:
Results:
Results:
Results
Figure 2: Adult dataset
Figure 3: Titanic dataset
Figure 4: Irish_1 dataset
Figure 5: Irish_2 dataset
Figure 6: Recidivsm dataset (1 seed)
Figure 7: Violent Recidivism dataset (1 seed)
For attribution, please cite this work as
Nieves-Cordones & Martinez-Plumed (2019, Jan. 7). Missing Fairness. Retrieved from https://nandomp.github.io/R/missingFairness.html
BibTeX citation
@misc{nieves2019missingfairness,
author = {Nieves-Cordones, David and Martinez-Plumed, Fernando},
title = {Missing Fairness},
url = {https://nandomp.github.io/R/missingFairness.html},
year = {2019}
}